Add WP_XML_Tag_Processor and WP_XML_Processor #6713

adamziel · 2024-06-03T14:36:43Z

Important

This PR remains open but is not being updated. The XML parser is actively evolving in the WordPress-Playground repository. If you'd like to contribute to the XML API, please do so in there. Once the API matures, this PR will be updated and reviewed for inclusion in WordPress core.

What

Proposes an XML Tag Processor and XML Processor to complement the HTML Tag processor and the HTML Processor.

The XML API implements a subset of the XML 1.0 specification and supports documents with the following characteristics:

XML 1.0
Well-formed
UTF-8 encoded
Not standalone (so can use external entities)
No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections

The API and ideas closely follow the HTML API implementation. The parser is streaming in nature, has a minimal memory footprint, and leaves unmodified markup as it was originally found.

It also supports streaming large XML documents, see adamziel#43 for proof of concept.

Design decisions

Ampersand handling in text and attribute values

XML attributes cannot contain the characters < or &.

Enforcing < is fast and easy. Enforcing & is slow and complex because ampersands are actually allowed when used as the start. This means we'd have to decode all entities as we scan through the document – it doesn't seem to be worth it.

Right now, the WP_XML_Tag_Processor will only bale out when attempting to explicitly access an attribute value or text containing an invalid entity reference.

Accepting all byte sequences up to `<` as text data

XML spec defines text data as follows:

[2] | Char | ::= | #x9 \| #xA \| #xD \| [#x20-#xD7FF] \| [#xE000-#xFFFD] \| [#x10000-#x10FFFF] | /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Currently WP_XML_Tag_Processor does not attempt to reject bytes outside of these ranges. It will treat everything between valid elements as text data. On the upside, we avoid using an expensive preg_match() call for processing text. On the downside, we won't bale out early when processing a malformed, truncated, or a non-UTF-8 document.

I think it's a good trade-off to take, but I'm also happy to discuss.

Entity decoding

This API only decodes the following entities:

Five character references mandated by the XML specification, that is &, <, >, ", '
Numeric character references, e.g. { or

Other entities trigger a parse error. This API could use the full HTML decoder, but it would be wrong and slow here. If lack of support for other entities ever becomes a problem, html_entity_decode() would be a good replacement for the custom implementation. It's worth noting that in XML1 mode it doesn't decode many HTML entities:

> html_entity_decode( '&oacute;' );
string(2) "ó"
> html_entity_decode( '&oacute;', ENT_XML1, 'UTF-8' );
string(8) "&oacute;"

Follow-up work

Usage examples
Document the WP_XML_Processor class as well as the WP_HTML_Processor class is documented today.
Add a ton more unit tests for other compliance and failure modes.

Out of scope for the foreseeable future

XML 1.1 support
Standalone XML documents
Encodings other than UTF-8
<!DOCTYPE, see https://www.w3.org/TR/xml/#sec-prolog-dtd
<!ATTLIST, see https://www.w3.org/TR/xml/#attdecls
<!ENTITY, see https://www.w3.org/TR/xml/#sec-entity-decl
<!NOTATION, see https://www.w3.org/TR/xml/#sec-entity-decl
Conditional sections, see https://www.w3.org/TR/xml/#sec-condition-sect

cc @dmsnell @sirreal

github-actions · 2024-06-03T14:37:05Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props zieladam, dmsnell, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

github-actions · 2024-06-03T14:53:19Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

dmsnell · 2024-06-03T19:12:09Z

src/wp-includes/html-api/class-wp-xml-tag-processor.php

+	 * | ----------------|---------------------------------------------------------------------|
+	 * | *Prolog*        | The parser is parsing the prolog.                                   |
+	 * | *Element*       | The parser is parsing the root element.                             |
+	 * | *Misc*          | The parser is parsing miscellaneous content.                        |


It might be helpful to start without an XML Tag Processor and go straight to the XML Processor, following the design of the HTML Processor with its step() and step_in_body() following the different contexts of the HTML API spec.

Prolog, for example, sounds an awful lot like the IN PROLOG context, where the rules can be contained and indicate a state transition rather than embedding the rules for staging in the lower-level tag processing.

Yup, good call

Shipping only an XML Processor was quite tedious

out of curiosity, what was tedious about it? I have a hard time understanding the need for an XML tag processor, since the nesting rules are straightforward when compared with HTML, where the complicated semantic rules forced a split

I found it difficult to:

Think about two different levels of parsing (tokenization, context) in a single class

Avoid intertwining them

Come up with names for all the methods

That being said, there's nothing inherently wrong with the approach and I can see how it could work.

…evel token parsing

dmsnell · 2024-06-04T17:02:33Z

src/wp-includes/html-api/class-wp-xml-decoder.php

+				 * We limit this scan to 30 characters, which allows twenty zeros at the front.
+				 */
+				30
+			);


in this case I don't think we want a limit on these. for one, XML requires the trailing semicolon, and for two, we don't want to add rules to the spec and accidentally let something slip in, for example, by prepending a long reference in the front.

I can clean this up here, but I think it's important we not mush together all of the characters. something like this…

if ( '#' === $text[ $next_character_reference_at + 1 ] ) { $is_hex = 'x' === $text[ $next_character_reference_at + 2 ] || 'X' === $text[ … ]; $zeros_start_at = $next_character_reference_at + 3 + ( $is_hex ? 1 : 0 ); $zeros_length = strspn( $text, '0', $zeros_start_at ); $digits_start_at = $zeros_start_at + $zeros_length; $digit_chars = $is_hex ? '0123456789abcdefABCDEF' : '0123456789'; $digits_length = strspn( $text, $digit_chars, $digits_start_at ); $semicolon_at = $digits_start_at + $digits_length; // Must be followed by a semicolon. if ( ';' !== $text[ $semicolon_at ] ) { return false; } // Null bytes cannot be encoded in XML. if ( 0 === $digits_length ) { return false; } /* * Must encode a valid Unicode code point. * (Avoid parsing more than is necessary). */ $max_digits = $is_hex ? 6 : 7; if ( $digits_length > $max_digits ) { return false; } $base = $is_hex ? 16 : 10; $code_point = intval( substr( $text, $digits_start_at, $digits_length ), $base ); if ( if_allowable_code_point( $code_point ) ) { $decoded .= WP_HTML_Decoder::code_point_to_utf8_bytes( $code_point ); continue; } return false; } // Must be a named character reference. $name_starts_at = $next_character_reference_at + 1; $standard_entities = array( 'amp;' => '&', 'apos;' => "'", 'gt;' => '>', 'lt;' => '<', 'quot;' => '"', ); foreach ( $standard_entities as $name => $replacement ) { if ( substr_compare( $text, $name, $name_starts_at, strlen( $name ) ) ) { $decoded .= $replacement; continue; } }

granted, we want to perform all length checks to avoid crashing, but I think we have to scan the entire thing to avoid mis-parsing and security issues

Oh, the decoder won't just skip over the 30 bytes when it stumbles upon &#x000000000000000000000000000000000000000000000000000000000000000000000123. If there's no valid reference within those 30 bytes, that's a parse failure and we halt. Still, I think you're right and it wouldn't be that bad. If someone loaded 1GB of data into memory, they must be expecting the parser will scan through it eventually.

sirreal · 2024-06-05T08:00:52Z

Have you seen the Extensible Markup Language (XML) Conformance Test Suites? It may be a helpful resource to find a lot of test cases, although after a quick scan many cases seem to violate "No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections"

… the MISC context where whitespace text nodes are always complete and non-whitespace ones are syntax errors

adamziel · 2024-06-06T10:36:10Z

Have you seen the Extensible Markup Language (XML) Conformance Test Suites? It may be a helpful resource to find a lot of test cases, although after a quick scan many cases seem to violate "No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections"

I haven't seen that one, thank you for sharing! With everything that's going on in Playground I may not be able to iron out all the remaining nuances here – would you like to take over at one point once XML parsing becomes a priority?

dmsnell · 2024-06-06T12:30:23Z

src/wp-includes/html-api/class-wp-xml-tag-processor.php

+	 *
+	 * @var bool
+	 */
+	protected $is_incomplete_text_node = false;


this name confused me. I think something like $inside_document_trailing_whitespace could be clearer, but I could still be confused

dmsnell · 2024-06-06T12:35:12Z

Remember too that text appears in XML outside of CDATA sections. Any text content between tags is still part of the inner text. CDATA is there to allow raw content without escaping it. For example, it's possible to embed HTML as-is inside CDATA whereas outside of it, all of the angle brackets and ampersands would need to be escaped.

In effect, CDATA could be viewed transparently in the decoder as part of any other text segment, where the only difference is that CDATA decoding is different than normal text decoding.

Probably need to examine the comment parser because XML cannot contain -- inside a comment. What @sirreal recommended about the test suite is a good idea - it would uncover all sorts of differences with HTML. Altogether though I don't think WordPress needs or wants a fully-spec-compliant parser. We want one that addresses our needs, which is a common type of incorrectly-generated XML. Maybe we could have a parser mode to reject all errors, but leave it off by default.

adamziel · 2024-06-06T13:42:10Z

@dmsnell Text is supported, or at least should be – is there a specific scenario that this PR doesn't handle today?

dmsnell · 2024-06-06T15:34:26Z

@adamziel it was specifically your get_inner_text() replacement above

adamziel · 2024-06-06T21:05:18Z

Oh that replacement is just overly selective, it can target text nodes, too.

By the way, there's no easy way of setting text in a tag without any text nodes. I'm noodling on emitting an empty text node, but haven't yet found a good way of doing it. Another idea would be to couple the tag tag token with the text that follows it, but that sounds unintuitive for the consumer of this API.

dmsnell · 2024-06-07T13:08:28Z

By the way, there's no easy way of setting text in a tag without any text nodes.

You're way too fast for me to keep up on this, but it shouldn't be that hard, because we will know where the start and end of a tag is, or if it's an empty tag, we know the full token.

We'll just need to create a function like set_inner_text() which matches the different cases and replaces the appropriate token or tokens to make it happen.

dmsnell · 2024-06-08T15:17:03Z

@adamziel I pushed some changes to the decoder to avoid mixing the HTML and XML parsing rules. Sadly, while I thought that PHP's html_entity_decode( $text, ENT_XML1 ) might be sufficient, it allows capital X in a hexadecimal numeric character reference, which is a divergence from the spec.

In mucking around I became aware of how much more the role of errors is going to have to play in an XML API. I don't have any idea what's best. Character encoding failures I would assume are going to be fairly benign as long as we treat those failures as plaintext instead of actually decoding them, but that's a point to ponder.

This is going to get interesting with documents mixing HTML and XML, such as WXR. We're going to need to ensure that the tag and text parsing rules are properly separated. I'm still not sure what that means for us when we find something like a WXR without proper escaping of the content inside.

adamziel · 2024-06-10T07:51:12Z

You're way too fast for me to keep up on this, but it shouldn't be that hard, because we will know where the start and end of a tag is, or if it's an empty tag, we know the full token.

Streaming makes it a bit more difficult, e.g. we may not have the closer yet, or we may not have the opener anymore. Perhaps pausing on incomplete input before yielding the tag opened would be useful here.

adamziel · 2024-06-10T07:54:23Z

@adamziel I pushed some changes to the decoder to avoid mixing the HTML and XML parsing rules. Sadly, while I thought that PHP's html_entity_decode( $text, ENT_XML1 ) might be sufficient, it allows capital X in a hexadecimal numeric character reference, which is a divergence from the spec.

Oh dang it! Too bad.

In mucking around I became aware of how much more the role of errors is going to have to play in an XML API. I don't have any idea what's best. Character encoding failures I would assume are going to be fairly benign as long as we treat those failures as plaintext instead of actually decoding them, but that's a point to ponder.

Yes, that struck me too. At minimum, we'll need to communicate on which byte offset the error has occurred. Ideally, we'd show the context of the document, highlight the relevant part, and give a highly informative error message.

This is going to get interesting with documents mixing HTML and XML, such as WXR. We're going to need to ensure that the tag and text parsing rules are properly separated. I'm still not sure what that means for us when we find something like a WXR without proper escaping of the content inside.

I'm not sure I follow. By escaping of the content do you mean, say, missing <<CDATA[ opener, or having an HTML CDATA-lookalike comment inside of an XML CDATA section? There isn't much we can do, other than marking specific tags as PCDATA and stripping the initial CDATA opener and final CDATA closer.

dmsnell · 2024-06-10T09:53:25Z

Streaming makes it a bit more difficult, e.g. we may not have the closer yet, or we may not have the opener anymore. Perhaps pausing on incomplete input before yielding the tag opened would be useful here.

My plan with the HTML API is to allow a "soft limit" on memory use. If we need to we can add an additional "hard limit" where it will fail. Should content come in and we're still inside a token, we just keep ballooning past the soft limit until we run out of memory, hit the hard limit, or parse the token.

So in this way I don't see streaming as a problem. The goal is to enable low-latency and low-overhead processing, but if we have to balloon in order to handle more extreme documents we can break the rules as long as it's not too wild.

I prototyped this with the 🔔 Notifications on WordPress.com though in a slightly different way. The Notifications Processor has a runtime duration limit and a length limit, counting code points in the text nodes of the processed document. If it hits the runtime duration limit, it stops processing formatting and instead rapidly joins the remaining text nodes as unformatted plaintext. If it hits the length limit it stops processing altogether.

I believe that this Processor framework opens up new avenues for constraints and graceful degradation beyond those limits.

Yes, that struck me too. At minimum, we'll need to communicate on which byte offset the error has occurred. Ideally, we'd show the context of the document, highlight the relevant part, and give a highly informative error message.

This could be an interesting operating mode: when bailing, produce an Elm/Rust-quality error message. The character reference errors make me think we also have some notion of recoverable and non-recoverable errors. The character reference error does not cause syntactic issues, so we can zoom past them if we want, collecting them along the way into a list of errors. Errors for things like invalid tag names, attribute names, etc… are similar.

Missing or unexpected tags though I could see as being more severe since they have ambiguous impact on the rest of the document.

I'm not sure I follow. By escaping of the content do you mean, say, missing <<CDATA[ opener, or having an HTML CDATA-lookalike comment inside of an XML CDATA section? There isn't much we can do, other than marking specific tags as PCDATA and stripping the initial CDATA opener and final CDATA closer.

Some WXRs I've seen will have something like <content:encoded><[CDATA[This is a post]]></content:encoded>. Others, instead of relying on CDATA directly encode the HTML (Blogger does this) and it looks like <content:encoded>This is a post</content:encoded>. The former is more easily recognizable.

I believe that I've seen WXRs that are broken by not encoding the HTML either way, and these are the ones that scare me: <content:encoded>This is a post</content:encoded>. Or maybe it has been (non-WXR) RSS feeds where I've seen this.

Because the embedded document isn't encoded there's no boundary to detect it. I think this implies that for WordPress, our XML parser could have reason to be itself a blend of XML and HTML. For example:

If a tag is known to be an HTML tag interpret it as part of the surrounding text content.
Like you brought up, list known WXR or XML tags and treat them differently.
Directly encode rules for HTML-containing tags that we see in practice. We could have a list, even a list of breadcrumbs where they may be found.

This exploration is quite helpful because I think it's starting to highlight how the shape of WordPress' XML parsing needs differ from those of HTML.

dmsnell · 2024-06-10T12:01:08Z

💡This will generally not be in the most critical performance hot path. We can probably relax some of the excessive optimizations, like relying on more functions to separate concepts like parse_tag_name(), parse_attribute_name(), and the like. This modularity would probably aid in comprehension, particularly since XML's rules are more constrained than HTML's.

Update: You already had this idea 😆

Brings together a few explorations to stream-rewrite site URLs in a WXR file coming from a remote server. All of that with no curl, DOMDocument, or other PHP dependencies. It's just a few small libraries built with WordPress core in mind: * [AsyncHttp\Client](WordPress/blueprints#52) * [WP_XML_Processor](WordPress/wordpress-develop#6713) * [WP_Block_Markup_Url_Processor](https://github.com/adamziel/site-transfer-protocol) * [WP_HTML_Tag_Processor](https://developer.wordpress.org/reference/classes/wp_html_tag_processor/) Here's what the rewriter looks like: ```php $wxr_url = "https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr"; $xml_processor = new WP_XML_Processor('', [], WP_XML_Processor::IN_PROLOG_CONTEXT); foreach( stream_remote_file( $wxr_url ) as $chunk ) { $xml_processor->stream_append_xml($chunk); foreach ( xml_next_content_node_for_rewriting( $xml_processor ) as $text ) { $string_new_site_url = 'https://mynew.site/'; $parsed_new_site_url = WP_URL::parse( $string_new_site_url ); $current_site_url = 'https://raw.githubusercontent.com/wordpress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/wxr-assets/'; $parsed_current_site_url = WP_URL::parse( $current_site_url ); $base_url = 'https://playground.internal'; $url_processor = new WP_Block_Markup_Url_Processor( $text, $base_url ); foreach ( html_next_url( $url_processor, $current_site_url ) as $parsed_matched_url ) { $updated_raw_url = rewrite_url( $url_processor->get_raw_url(), $parsed_matched_url, $parsed_current_site_url, $parsed_new_site_url ); $url_processor->set_raw_url( $updated_raw_url ); } $updated_text = $url_processor->get_updated_html(); if ($updated_text !== $text) { $xml_processor->set_modifiable_text($updated_text); } } echo $xml_processor->get_processed_xml(); } echo $xml_processor->get_unprocessed_xml(); ```

@dmsnell

…ools (#1888) Let's officially kickoff [the Data Liberation](https://wordpress.org/data-liberation/) efforts under the Playground umbrella and unlock powerful new use cases for WordPress. ## Rationale ### Why work on Data Liberation? WordPress core _really_ needs reliable data migration tools. There's just no reliable, free, open source solution for: - Content import and export - Site import and export - Site transfer and bulk transfers, e.g. mass WordPress -> WordPress, or Tumblr -> WordPress - Site-to-site synchronization Yes, there's the WXR content export. However, it won't help you backup a photography blog full of media files, plugins, API integrations, and custom tables. There are paid products out there, but nothing in core. At the same time, so many Playground use-cases are **all about moving your data**. Exporting your site as a zip archive, migrating between hosts with the [Data Liberation browser extension](https://github.com/WordPress/try-wordpress/), creating interactive tutorials and showcasing beautiful sites using [the Playground block](https://wordpress.org/plugins/interactive-code-block/), previewing Pull Requests, building new themes, and [editing documentation](#1524) are just the tip of the iceberg. ### Why the existing data migration tools fall short? Moving data around seems easy, but it's a complex problem – consider migrating links. Imagine you're moving a site from [https://my-old-site.com](https://playground-site-1.com) to [https://my-new-site.com/blog/](https://my-site-2.com). If you just moved the posts, all the links would still point to the old domain so you'll need an importer that can adjust all the URLs in your entire database. However, the typical tools like `preg_replace` or `wp search_replace` can only replace some URLs correctly. They won't reliably adjust deeply encoded data, such as this URL inside JSON inside an HTML comment inside a WXR export: The only way to perform a reliable replacement here is to carefully parse each and every data format and replace the relevant parts of the URL at the bottom of it. That requires four parsers: an XML parser, an HTML parser, a JSON parser, a WHATWG URL parser. Most of those tools don't exist in PHP. PHP provides `json_encode()`, which isn't free of issues, and that's it. You can't even rely on DOMDocument to parse XML because of its limited availability and non-streaming nature. ### Why build this in Playground? Playground gives us a lot for free: - **Customer-centric environment.** The need to move data around is so natural in Playground. So many people asked for reliable WXR imports, site exports, synchronization with git, and the ability to share their Playground. Playground allows us to get active users and customer feedback every step of the way. - **Free QA**. Anyone can share a testing link and easily report any problems they found. Playground is the perfect environment to get ample, fast moving feedback. - **Space to mature the API**. Playground doesn’t provide the same backward compatibility guarantees as WordPress core. It's easy to prototype a parser, find a use case where the design breaks down, and start over. - **Control over the runtime.** Playground can lean on PHP extensions to validate our ideas, test them on a simulated slow hardware, and ship them to a tablet to see how they do when the app goes into background and the internet is flaky. Playground enables methodically building spec-compliant software to create the solid foundation WordPress needs. ## The way there ### What needs to be built? There's been a lot of [gathering information, ideas, and tools](https://core.trac.wordpress.org/ticket/60375). This writeup is based on 10 years worth of site transfer problems, WordPress synchronization plugins, chats with developers, analyzing existing codebases, past attempts at data importing, non-WordPress tools, discussions, and more. WordPress needs parsers. Not just any parsers, they must be streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. The data synchronization tools must account for data conflicts, WordPress plugins, invalid inputs, and unexpected power outages. The errors must be non-fatal, retryable, and allow manual resolution by the user. No data loss, ever. The transfer target site should be usable as early as possible and show no broken links or images during the transfer. That's the gist of it. A number of parsers have already been prototyped. There's even [a draft of reliable URL rewriting library](https://github.com/adamziel/site-transfer-protocol). Here's a bunch of early drafts of specific streaming use-cases: - [A URL parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_URL.php) - [A block markup parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_Block_Markup_Processor.php) - [An XML parser](WordPress/wordpress-develop#6713), also explored by @dmsnell and @jonsurrell - [A Zip archive parser](https://github.com/WordPress/blueprints-library/blob/87afea1f9a244062a14aeff3949aae054bf74b70/src/WordPress/Zip/ZipStreamReader.php) - [A multihandle HTTP client](https://github.com/WordPress/blueprints-library/blob/trunk/src/WordPress/AsyncHttp/Client.php) without curl dependency - [A MySQL query parser](WordPress/sqlite-database-integration#157) started by @zieladam and now explored by @JanJakes - [A stream chaining API](adamziel/wxr-normalize#1) to connect all these pieces On top of that, WordPress core now has an HTML parser, and @dmsnell have been exploring a [UTF-8](WordPress/wordpress-develop#6883) decoder that would to enable fast and regex-less URL detection in long data streams. There are still technical challenges to figure out, such as how to pause and resume the data streaming. As this work progresses, you'll start seeing incremental improvements in Playground. One possible roadmap is shipping a reliable content importer, then reliable site zip importer and exporter, then cloning a site, and then extends towards full-featured site transfers and synchronization. ### How soon can it be shipped? Three points: * No dates. * Let's keep building on top of prior work and ship meaningful user flows often. * Let's not ship any stable public APIs until the design is mature. For example, the [Try WordPress extension](https://github.com/WordPress/try-wordpress/) can already give you a Playground site, even if you cannot migrate it to another WordPress site just yet. **Shipping matters. At the same time, taking the time required to build rigorous, reliable software is also important**. An occasional early version of this or that parser may be shipped once its architecture seems alright, but the architecture and the stable API won't be rushed. That would jeopardize the entire project. This project aims for a solid design that will serve WordPress for years. The progress will be communicated in the open, while maintaining feedback loops and using the work to ship new Playground features. ## Plans, goals, details ### Next steps Let's start with building a tool to export and import _a single WordPress post_. Yes! Just one post. The tricky part is that all the URLs will have to be preserved. From there, let's explore the breadth and depth of the problem, e.g.: * Rewriting links * Frontloading media files * Preserving dependent data (post meta, custom tables, etc.) * Exporting/importing a WXR file using the above * Pausing and resuming a WXR export/import * Exporting/importing a full WordPress site as a zip file Ideally, each milestone will result in a small, readily reusable tool. For example "paste WordPress post, paste a new site URL, get your post migrated". There's an ample body of existing work. Let's keep the existing codebases (e.g. WXR, site migration plugins) and discussions open in a browser window during this work. Let's involve the authors of these tools, ask them questions, ask them for reviews. Let's publish the progress and the challenges encountered on the way. ### Design goals - **Fault tolerance** – all the data tools should be able to start, stop, resume, tolerate errors, accept alternative data from the user, e.g. media files, posts etc. - **WordPress-first** – let's build everything in PHP using WordPress naming conventions. - **Compatibility** – Every WordPress version, PHP version (7.2+, CLI), and Playground runtime (web, CLI, browser extension, desktop app, CI etc.) should be supported. - **Dependency-free** – No PHP extensions required. If this means we can't rely on cUrl, then let's build an HTTP client from scratch. Only minimal Composer dependencies allowed, and only when absolutely necessary. - **Simplicity** – no advanced OOP patterns. Our role model is [WP_HTML_Processor](https://developer.wordpress.org/reference/classes/wp_html_processor/) – a **single class** that can parse nearly all HTML. There's no "Node", "Element", "Attribute" classes etc. Let's aim for the same here. - **Extensibility** – Playground should be able to benefit from, say, WASM markdown parser even if core WordPress cannot. - **Reusability** – Each library should be framework-agnostic and usable outside of WordPress. We should be able to use them in WordPress core, WP-CLI, Blueprint steps, Drupal, Symfony bundles, non-WordPress tools like https://github.com/adamziel/playground-content-converters, and even in Next.js via PHP.wasm. ### Prior art Here's a few codebases that needs to be reviewed at minimum, and brought into this project at maximum: - URL rewriter: https://github.com/adamziel/site-transfer-protocol - URL detector : WordPress/wordpress-develop#7450 - WXR rewriter: https://github.com/adamziel/wxr-normalize/ - Stream Chain: adamziel/wxr-normalize#1 - WordPress/wordpress-develop#5466 - WordPress/wordpress-develop#6666 - XML parser: WordPress/wordpress-develop#6713 - Streaming PHP parsers: https://github.com/WordPress/blueprints-library/tree/trunk/src/WordPress - Zip64 support (in JS ZIP parser): #1799 - Local Zip file reader in PHP (seeks to central directory, seeks back as needed): https://github.com/adamziel/wxr-normalize/blob/rewrite-remote-xml/zip-stream-reader-local.php - WordPress/wordpress-develop#6883 - Blocky formats – Markdown <-> Block markup WordPress plugin: https://github.com/dmsnell/blocky-formats - Sandbox Site plugin that exports and imports WordPress to/from a zip file: https://github.com/WordPress/playground-tools/tree/trunk/packages/playground - WordPress + Playground CLI setup to import, convert, and exporting data: https://github.com/adamziel/playground-content-converters - Markdown -> Playground workflow _and WordPress plugins_: https://github.com/adamziel/playground-docs-workflow - _Edit Visually_ browser extension for bringing data in and out of Playground: WordPress/playground-tools#298 - _Try WordPress_ browser extension that imports existing WordPress and non-WordPress sites to Playground: https://github.com/WordPress/try-wordpress/ - Humanmade WXR importer designed by @rmccue: https://github.com/humanmade/WordPress-Importer ### Related resources - [Site transfer protocol](https://core.trac.wordpress.org/ticket/60375) - [Existing data migration plugins](https://core.trac.wordpress.org/ticket/60375#comment:32) - WordPress/data-liberation#74 - #1524 - WordPress/gutenberg#65012 ### The project structure The structure of the `data-liberation` package is an open exploration and will change multiple times. Here's what it aims to achieve. **Structural goals:** - Publish each library as a separate Composer package - Publish each WordPress plugin separately (perhaps a single plugin would be the most useful?) - No duplication of libraries between WordPress plugins - Easy installation in Playground via Blueprints, e.g. no `composer install` required - Compatibility with different Playground runtimes (web, CLI) and versions of WordPress and PHP **Logical parts** - First-party libraries, e.g. streaming parsers - WordPress plugins where those libraries are used, e.g. content importers - Third party libraries installed via Composer, e.g. a URL parser **Ideas:** - Use Composer dependency graph to automatically resolve dependencies between libraries and WordPress plugins - or use WordPress "required plugins" feature to manage dependencies - or use Blueprints to manage dependencies cc @brandonpayton @bgrgicak @mho22 @griffbrad @akirk @psrpinto @ashfame @ryanwelcher @justintadlock @azaozz @annezazu @mtias @schlessera @swissspidy @eliot-akira @sirreal @obenland @rralian @ockham @youknowriad @ellatrix @mcsf @hellofromtonya @jsnajdr @dawidurbanski @palmiak @JanJakes @luisherranz @naruniec @peterwilsoncc @priethor @zzap @michalczaplinski @danluu

A part of #1894. Follows up on #1893. This PR brings in a few more PHP APIs that were initially explored outside of Playground so that they can be incubated in Playground. See the linked descriptions for more details about each API: * XML Processor from WordPress/wordpress-develop#6713 * Stream chain from adamziel/wxr-normalize#1 * A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR files ## Testing instructions * Confirm the PHPUnit tests pass in CI * Confirm the test suite looks reasonabel * That's it for now! It's all new code that's not actually used anywhere in Playground yet. I just want to merge it to keep iterating and improving.

…essor (#1960) Merge `WP_XML_Tag_Processor` and `WP_XML_Processor` into a single `WP_XML_Processor` class. This reduces abstractions, enables keeping more properties as private, and simplifies the code. Related to #1894 and WordPress/wordpress-develop#6713 ## Testing instructions Confirm the CI tests pass.

adamziel · 2024-10-31T12:35:56Z

I ported this work to https://github.com/WordPress/wordpress-playground/ and experimentally merged the two XML processors into a single WP_XML_Processor class. Let's see how it goes.

adamziel added 4 commits June 3, 2024 14:50

XML Processor: First stab

9e69192

Don't update state in parse_name();

00a63ae

Use NameCharacters to parse tag names

c9f3450

Support CData and processing instructions

3f645db

adamziel added 18 commits June 3, 2024 17:14

Support stack of open elements

323d065

Support breadcrumbs

63ae575

implement 2.11 End-of-Line Handling

9558db1

Clean up comments

40df7a6

Only support UTF-8

6bbe94e

Uncomment more tests

087dc96

Uncomment more tests

91310c5

Restrict what's accepted after the root element is closed

ecb0347

formalize parsing prolog, element, and misc

7ae34bf

Remove the concept of a comment type

28e91d8

Remove COMMENT_AS_XML_COMMENT

a3b5e93

Validate whether the root closer was found

7285cd0

Pause when root isn't closed

a6958c9

Remove references to html spec

43149aa

Class-level comment

3df4a19

Document parsing_stage

6c15968

Support PCData

13ba001

Decode attributes values

ae87c69

dmsnell reviewed Jun 3, 2024

View reviewed changes

adamziel added 5 commits June 3, 2024 21:20

Document the class a bit more

c96bc74

Support entity decoding

00448b0

Document more and replace HTML tags with XML-lookalikes

aef3baa

Add a step() function to decouple the high-level structure from low-l…

e5f3d35

…evel token parsing

Refactor to separate the higher-order state even more

be70441

dmsnell reviewed Jun 4, 2024

View reviewed changes

Make text nodes incomplete until the next tag is available, except in…

f06b8f8

… the MISC context where whitespace text nodes are always complete and non-whitespace ones are syntax errors

dmsnell reviewed Jun 6, 2024

View reviewed changes

dmsnell added 3 commits June 8, 2024 16:14

Move into separate XML directories.

9922153

Much around with the XML decoder.

9917ef3

Add link to XML spec in class package docs.

a905009

More stewing on the decoder and name/attribute value parsers

3e7b817

adamziel mentioned this pull request Jul 15, 2024

StreamChain: An API for streams-processing data (e.g. HTTP → ZIP → XML → HTML) adamziel/wxr-normalize#1

Closed

adamziel mentioned this pull request Oct 11, 2024

Kickoff Data Liberation: Let's Build WordPress-first Data Migration Tools WordPress/wordpress-playground#1888

Merged

adamziel mentioned this pull request Oct 31, 2024

[Data Liberation] Merge both XML processors into a single WP_XML_Processor WordPress/wordpress-playground#1960

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WP_XML_Tag_Processor and WP_XML_Processor #6713

Add WP_XML_Tag_Processor and WP_XML_Processor #6713

adamziel commented Jun 3, 2024 •

edited

Loading

github-actions bot commented Jun 3, 2024 •

edited

Loading

github-actions bot commented Jun 3, 2024

dmsnell Jun 3, 2024

adamziel Jun 3, 2024

dmsnell Jun 4, 2024

adamziel Jun 10, 2024

dmsnell Jun 4, 2024

adamziel Jun 4, 2024 •

edited

Loading

sirreal commented Jun 5, 2024

adamziel commented Jun 6, 2024

dmsnell Jun 6, 2024

dmsnell commented Jun 6, 2024

adamziel commented Jun 6, 2024

dmsnell commented Jun 6, 2024

adamziel commented Jun 6, 2024

dmsnell commented Jun 7, 2024

dmsnell commented Jun 8, 2024

adamziel commented Jun 10, 2024

adamziel commented Jun 10, 2024

dmsnell commented Jun 10, 2024

dmsnell commented Jun 10, 2024 •

edited

Loading

adamziel commented Oct 31, 2024

Add WP_XML_Tag_Processor and WP_XML_Processor #6713

Are you sure you want to change the base?

Add WP_XML_Tag_Processor and WP_XML_Processor #6713

Conversation

adamziel commented Jun 3, 2024 • edited Loading

What

Design decisions

Ampersand handling in text and attribute values

Accepting all byte sequences up to < as text data

Entity decoding

Follow-up work

Out of scope for the foreseeable future

github-actions bot commented Jun 3, 2024 • edited Loading

github-actions bot commented Jun 3, 2024

Test using WordPress Playground

Some things to be aware of

dmsnell Jun 3, 2024

Choose a reason for hiding this comment

adamziel Jun 3, 2024

Choose a reason for hiding this comment

dmsnell Jun 4, 2024

Choose a reason for hiding this comment

adamziel Jun 10, 2024

Choose a reason for hiding this comment

dmsnell Jun 4, 2024

Choose a reason for hiding this comment

adamziel Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

sirreal commented Jun 5, 2024

adamziel commented Jun 6, 2024

dmsnell Jun 6, 2024

Choose a reason for hiding this comment

dmsnell commented Jun 6, 2024

adamziel commented Jun 6, 2024

dmsnell commented Jun 6, 2024

adamziel commented Jun 6, 2024

dmsnell commented Jun 7, 2024

dmsnell commented Jun 8, 2024

adamziel commented Jun 10, 2024

adamziel commented Jun 10, 2024

dmsnell commented Jun 10, 2024

dmsnell commented Jun 10, 2024 • edited Loading

adamziel commented Oct 31, 2024

adamziel commented Jun 3, 2024 •

edited

Loading

Accepting all byte sequences up to `<` as text data

github-actions bot commented Jun 3, 2024 •

edited

Loading

adamziel Jun 4, 2024 •

edited

Loading

dmsnell commented Jun 10, 2024 •

edited

Loading